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I .  THE  MULTICACHE-OCNS I STENCY  PRCBLEM:  AN  INTTRODOCTICN 


A  "consistency  problem"  refers,  in  general,  to  a  situation 
where  two  or  more  entries  representing  the  same  "fact"  in  a  data 
base  differ  (i.e.,  are  inconsistent) .  This,  of  course,  can  only 
occur  when  redundancy  exists.  In  this  paper  we  will  be  concerned 
with  the  "consistency  problem"  that  arises  in  cache-based  memory 
sys  bents. 

A  cadre  memory  system  (Kaplan  and  Winder,  1 £73)  represents  a 
type  of  memory  hierarchy  that  ttempts  to  bridge  the  CPU-main  memory 
speed  gap  by  the  use  of  a  small,  high  speed  randan  access  memory 
whose  cost  per  bit  is  higher  than  that  of  main  memory,  but  whose 
total  cost  is  relatively  small  because  of  the  small  size. 

Conceptually,  this  configuration  has  analogies  with  paging  systems 
(Matick,  1977) .  The  implementations,  however,  are  far  apart  because 
of  speed  considerations.  In  contrast  to  a  paging  system,  a  cache  is 
managed  by  hardware  algorithms,  provides  a  smaller  ratio  of  memory 
access  times  (e.g.,  10:1  rattier  than  1000:1),  and  deals  with  smaller 
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blocks  of  data  (64  bytes  for  example  rather  than  4096) . 

In  a  cache  system,  all  data  are  referenced  by  their  main  memory 
address.  At  any  given  time,  a  certain  subset  of  the  contents  of 
main  memory  is  contained  in  the  cache  level.  If  a  processor  then 
requests  a  data  item  in  this  subset,  the  request  is  serviced  at  the 
cache  level. 

A  cache-based  system  works  "well"  for  two  basic  reasons: 

First,  executing  programs  tend  to  re-use  instructions  and  data;  and 
second,  programs  tend  to  use  instructions  and  data  near  recently 
usea  instructions  and  data.  The  first  property  means  that  once 
information  is  fetched  from  main  memory  to  cache,  subsequent 
accesses  to  it  are  at  cache  speed.  The  second  property  means  that 
if  a  request  to  main  memory  is  satisfied  by  bringing  into  a  cache  a 
block  of  information  larger  than  is  inmediately  needed,  the 
additional  information  is  likely  to  be  needed  soon,  and  its  presence 
in  the  cache  will  save  one  or  more  references  to  main  memory. 

In  this  paper  we  will  refer  to  the  block  of  information  that 
constitutes  the  minimun  amount  of  data  which  may  be  transmitted 
between  the  cache  and  main  memory  and  which  is  also  the  allocation 
unit  in  the  cache  as  the  "cache  lino."  All  bytes  of  a  cache  line 
are,  therefore,  simultaneously  all  present  or  all  absent  from  the 
cache.  A  directory  is  usually  used  to  record  the  main  memory 
addresses  of  all  lines  in  the  cache. 


It  is  easy  to  demonstrate  how  inconsistency  can  develop  in 
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cache-baaed  systems  cue  to  the  existence  of  redundancy.  Oonsider 
first  the  simple  case  of  a  single-cache  organization  (Figure  1(a)). 

The  CPU  can  only  access  words  that  are  in  the  cache.  If  a  word 
that  is  needed  for  processing  is  not  already  in  the  cache,  it  will 
first  have  to  be  transferred  from  main  memory  to  the  cache.  Once 
the  word  is  in  the  cache  it  becomes  acces sable  by  the  CPU  and 
processing  can  take  place.  If  the  CEO  then  updates  (i.e.,  modifies) 
the  word,  inconsistency  between  the  copy  in  the  cache  and  the  copy 
in  main  memory  could  develop.  This  depends  on  the  store  algorithm 
used. 


If  a  store  through  algorithm  (in  which  the  cache  and  main 
.nemory  are  updated  simultaneously)  is  used,  inconsistency  will  not 
arise.  The  price  paid  for  that  is  a  decrease  in  processing  speed  as 
store  operations  become  limited  by  the  speed  of  main  memory.  When 
this  price  is  too  high,  store-behind  or  store-replacement  algorithms 
may  be  used.  In  both  cases  main  memory  is  not  updated  immediately, 
and  as  a  result  inconsistency  arises.  The  modified  word  in  the 
cache  will,  for  "seme"  interval  of  time,  be  different  frem  its 
unmodified  version  in  main  memory. 

The  more  interesting  case,  however,  is  that  of  multicache' 
systems.  Oonsider  the  two-cache  organization  of  Figure  1(b).  What 
is  important  to  emphasize  here,  is  that  the  store- through  algorithm 
is  no  longer  sufficient  to  avoid  inconsistency.  Assume,  for 
example,  a  word  whose  main  memory  address  is  (A),  and  whose  current 
value  is  (V)  is  present  in  both  cache 1  and  cache2.  CPU1  then 
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modifies  the  value  of  the  word  in  its  cachel  to  (V.),  and  assuming  a 
store-through  algorithm  is  used,  main  memory  is  simultaneously 
updated.  However,  cache 2  continues  to  have  the  unmodified  version 
(V) .  If,  before  this  inconsistency  is  resolved  by  either  updating, 
replacing,  or  invalidating  cache2.s  copy,  CPU2  attempts  to  access 
the  vrord  (A) ,  it  will  get  what  is  now  an  invalid  value  (V)  from  its 
cache2. 

It  is  time  now  to  adopt  what  we  believe  is  a  more  useful 
definition  of  what  a  "consistency  problem"  is.  We  claim  that 
inconsistency  per  se  is  not  necessarily  a  problem.  Reconsider  the 
case  of  a  single-cache  organization.  We  have  already  explained  how 
inconsistency  can  arise  for  "some"  interval  of  time  when  the  store 
through  algorithm  is  not  used.  During  that  interval  of  time  the 
cache  will  contain  the  modified  version  of  the  word,  while  main 
memory  will  not.  If,  during  this  interval,  the  CPU  needs  to 
re-aocess  the  word  for  processing,  what  will  happen?  It  will  check 
the  cache,  find  the  word  in  it,  and,  therefore,  access  it  i.e.,  no 
transfer  from  main  memory  will  be  needed.  Thus,  although 
inconsistency  exists,  no  problem  arises  because  the  CPU  will  always 
access  tlie  updated  version  of  the  word  frem  the  cache. 

With  this  in  mind,  we  now  adopt  the  following  definition  of  a 
"consistency  problem"  (Censier  and  Feautrier,  1978): 

"A  consistency  problem  exists  in  a  cache-based  system  if  the 
value  accessed  by  a  CPU  is  not  the  value  given  by  the  latest 
store  operation  (by  any  CPU)  to  the  same  address." 
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It  is  obvious,  frcm  the  above  discussion,  that  there  will  be  no 
consistency  problem  for  single -cache  organizations.  These, 
unfortunately,  are  not  very  attractive  for  high  performance  systems. 

In  the  next  part  of  this  paper  we  present  a  brief  discussion  of 
INPQPLEX,  a  highly  parallel  multi -processor  computer  system  that 
utilizes  a  multi cache  organization.  In  such  an  organization  the 
consistency  problem  is  a  significant  one.  The  INP0PL.3C  organization 
will  constitute  the  context  within  which  we  shall  evaluate  the 
different  approaches  for  handling  the  multicache-oonsistency  problem. 
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II.  INFOPLEX:  A  MULTI -PROCESSOR  OMWER  SYSTEM 


A  research  project  is  currently  underway  at  the  MIT  Sloan  School 
of  Management  aimed  at  investigating  the  architecture  of  a  new  data 
base  computer,  called  INFOPLEX,  which  is  particularly  suitable  for 
large-scale  information  management  (Madnick,  1575) .  The  specific 
objectives  of  the  project  include  providing  substantial  performance 
improvements  over  conventional  architectures,  supporting  very  large 
aonplex  data  bases,  and  providing  extremely  high  reliability. 

To  provide  a  high  performance,  highly  reliable,  and  large  capacity 
storage  system,  INFOPLEX  makes  use  of  an  automatically  managed  memory 
hierarchy.  It  is  this  aspect  of  the  project  that  will  be  of  relevance 
in  our  present  discussion. 

As  a  simplistic  illustration,  we  show  in  Figure  2  three  levels 
(only)  of  the  memory  hierarchy.  As  can  be  seen,  this  is  a  multicache 
organization.  The  proposed  number  of  caches  (m)  is  relatively  large 
compared  to  present  day  systems  (e.g.,  m  «  32) .  For  the  INFOPLEX 
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objectives  such  a  large  nunber  of  caches  is  essential.  It  will,  for 
example,  help  attain  the  high  performance  improvements  sought  (up  to  a 
1000  fold  increase  in  throughput  over  conventional  architectures) .  In 
addition,  it  allows  for  the  implementation  of  such  features  as  dynamic 
reconfiguration  and  automatic  recovery  which  are  aimed  at  inproving  the 
reliability  of  the  system. 

Two  important  "boxes"  in  the  design  of  Figure  2  are  the  Storage 
Level  Controller  (SLC)  and  the  Memory  Request  Processor  (MRP) .  The 
function  of  the  SLC  is  to  couple  the  local  bus  of  a  storage  level  to 
the  global  bus  that  connects  all  storage  levels.  In  essence,  the  SLC 
serves  as  a  gateway  between  levels.  For  example ,  the  SIC  of  level  (1) 
accepts  requests  to  the  lower  storage  levels  frcm  the  caches  and 
forwards  them  to  the  SIC  of  level  (2) .  When  the  responses  to  these 
recvests  are  ready,  the  level  (1)  SIC  accepts  them  and  sends  them  back 
to  the  appropriate  caches . 

The  Memory  Request  Processor  (MRP)  performs  such  functions  as: 
implementing  the  storage  management  algorithms  (e .g  , ,  directing  the 
transfer  of  information  across  a  storage  level) ;  handling  all  the 
communication  protocols  that  are  peculiar  to  the  particular  storage 
modules  (devices)  at  a  storage  level;  and  mapping  virtual  addresses 
into  their  real  equivalents.  (Note  that  an  MRP  is  not  needed  at  the 
cache  level.) 

The  INFOPLEX  organization  described  (briefly)  above  will 
constitute  the  context  within  which  we  shall  study  the  different 
approaches  for  handling  the  mul t i cache-cons i s tency  problem.  For 


further  information  on  INFOPLEX  the  interested  reader  can  consult  the 
following  references:  (Macnick,  1975;  Ma&iick,  1579;  Lam,  1979;  Lam 
and  Madnick,  1575;  and  Hsu,  1980) . 


III.  THREE  COMMON  APPROACHES  FOR  SOLVING  THE 


CACHE-GONS I STENCY  PROBLEM 


In  Part  (I)  we  explained  how  a  consistency  problem  could  arise  in 
multicache  memory  systems.  We  saw,  for  example,  that  in  the  two-cache 
organization  of  Figure  1(b)  a  word  (A)  that  existed  in  both  cachel  and 
cache2  could  be  modified  to  (V.)  in  cachel  and  in  main  memory  but  not 
in  cache2,  and  thus  giving  rise  to  an  inconsistent  state.  This 
example ,  although  rather  sinple ,  is  quite  adequate  to  demonstrate  the 
motivations  behind  the  basic  strategies  that  have  been  used  to  handle 
the  consistency  problem  in  multicache  systems.  There  are  two  such 
strategies.  First,  we  could  restrict  the  "encacheability"  of  data 
items,  such  that  only  those  data  items  that  cannot  cause 
inconsistencies  are  allowed  to  move  into  the  cache  level.  For  example# 
words  that  can  only  be  READ  would  be  encacheable.  On  the  other  hand, 
all  data  items  that  could  potentially  cause  inconsistencies  are 
prohibited  from  moving  into  the  cache  level,  and  thus  all  accesses  to 
them  are  done  through  main  memory.  Word  (A)  of  the  above  example 
would,  therefore,  fall  in  this  category,  and  so  it  (under  this 
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strategy)  would  have  been  prohibited  from  moving  into  either  cachel  or 
cacbe2.  Thus  accesses  to  word  (A)  by  both  CPU1  and  CPU2  would  have 
been  made  to  its  single  copy  in  main  memory,  and  no  inconsistency  would 
have  resulted.  The  price  paid,  however,  is  that  accesses  to  word  (A) 
are  now  done  at  main  memory  speed  and  not  at  the  faster  cache  speed. 

The  second  basic  strategy  that  has  been  employed  doesn.t  put  any 
such  restrictions  on  moving  data  items  into  the  cache  level.  The  idea 
here  is  to  "invalidate"  a  cache  line  when  there  is  a  risk  that  its 
contents  have  been  modified  elsewhere  in  the  system.  When  a  cache  line 
is  invalidated  (by  setting,  for  example,  a  flag  in  the  cache  directory) 
it  is  considered  not  in  the  cache.  Referring  again  to  the  above 
example,  when  the  value  of  word  (A)  is  modified  in  cachel  (and  in  main 
memory)  to  (V.) ,  word  (A)  in  oache2  is  invalidated.  Thus  if  CPU2 
happens  to  request  word  (A)  at  a  later  time,  it  will  have  to  access  it 
from  main  memory  since  the  invalidated  version  (V)  in  its  cache  is 
considered  not  to  exist.  CPU2  will,  therefore,  access  the  valid  value 
(V.). 


In  the  remainder  of  this  section  we  will  present  more  specific 
approaches  to  handle  the  consistency  problem  in  cache-based  systems. 

In  particular,  three  approaches  that  are  proposed  in  the  literature 
will  bo  discussed,  namely,  the  "Broadcasting,"  the  "Store-Controller," 
and  the  "Multics"  approaches.  The  "Broadcasting"  and 
"Store-Controller"  approaches  are  based  on  the  second  strategy 
discussed  above.  They,  though,  implement  it  differently.  The 
"Multics"  approach,  on  the  other  hand,  is  based  on  the  first  strategy. 
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HI.l.  The  "Broadcasting"  Approach 

The  idea  here  (as  mentioned  in  the  second  strategy  above)  is  to 
invalidate  a  cache  line  when  its  contents  is  modified  in  another  cache 
in  the  system.  When  a  cache  line  is  modified  its  address  is 
broadcasted  throughout  the  system  so  that  other  caches  sharing  the  line 
would  invalidate  their  row  outdated  version  of  it. 

Every  oaclie  is  connected  to  an  auxiliary  data  path  over  which  all 
other  caches  send  the  addresses  of  lines  to  be  modified.  Each  cache 
constantly  monitors  this  path  and  executes  a  searching  algorithm  on  all 
addresses  thus  received.  In  case  of  a  "hit,"  the  affected  line  is 
invalidated. 

When  a  CPU  needs  to  read  (or  write)  a  weed  that  doesn.t  exist  in 
its  own  cache,  the  word  will  be  seized  frero  main  memory.  To  ensure 
consistency  main  memory  must  always  be  kept  "up-to-date."  This  (in 
general)  can  be  guaranteed  only  if  a  store-through  algorithm  (in  which 
the  cache  and  main  memory  are  updated  simultaneously)  is  used.  As  was 
argued  before,  such  a  restriction  is  not  without  its  cost:  A  decrease 
in  processing  speed  as  store  operations  become  limited  by  the  speed  of 
main  memory. 

Anotl>er  major  drawback  of  this  approach  is  that  the  invalidation 
data  path  must  accomodate  a  very  high  traffic.  The  mean  write  rate  for 
most  processor  architectures  lies  in  the  range  between  10  and  30  per 
cent  (Censier  and  Feautrier,  1978),  and  thus  if  the  number  of 
processors  is  higher  than  two,  the  productive  traffic  between  a  cache 
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and  its  associated  processor  may  bo  lower  than  the  parasitic  traffic 
between  the  cache  and  all  other  caches.  This  explains  why  this 
approach  has  been  confined  to  systems  with  at  nest  two  caches  (Censier 
and  Feautrier,  1978) . 


III. 2.  The  "Store-Controller"  Approach: 

The  basic  idea  here  is  the  same  as  in  the  "Broadcasting"  approach 
i.e.  to  invalidate  a  cache  line  when  its  contents  have  been  modified 
elsewhere  in  the  system.  It  is  in  the  implementation  that  the  two 
approaches  differ.  Here,  a  "Store-Controller"  SC  (see  Figure  3(a))  is 
used  at  the  cache  level  to  keep  track  of  every  line  in  every  cache. 

The  stor€!-oontroller  "knows,"  not  only  which  lines  are  in  which  cache, 
but  also  which  caches  share  any  single  line.  When,  therefore,  a  line 
that  is  shared  by  two  or  more  caches  is  updated  (i.e.,  modified)  in  one 
Of  them  we  do  not  new  need  to  broadcast  invalidation  requests  to  all 
caches.  We,  instead,  use  the  information  in  the  store-controller  to 
send  invalidation  requests  to  only  those  caches  that  are  sharing  the 
updated  line,  if  any.  In  other  words,  the  motivation  behind  using  the 
store-controller  is  to  filter  out  all  unnecessary  invalidation 
requests. 

We  will  present,  in  a  flowcharted  form,  an  example  implementation 
of  tliis  approach  which  is  largely  based  on  Tang.s  proposals  (Tang, 

197b) .  in  this  implementation,  if  a  processor  wants  to  write  and  the 
line  is  not  found  in  its  cache,  then  the  line  is  always  brought  to  the 
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caclie  ao  that  the  processor  can  always  write  to  cache.  The  store 
algorithm  used  is  the  "store-replacement”  algorithm.  This  means  that 
when  a  line  is  modified  in  a  cache,  main  memory  is  not  concurrently 
updated,  it  is  updated  later  when  tho  line  has  to  be  replaced  in  the 
cache  or  when  other  caches  need  to  have  the  line. 

Figure  3(a)  shows  how  the  store-controller  fits  conceptually  into 
the  cache  organization.  Figures  3(b)  and  3(c)  show  possible  layouts 
for  the  directories  of  both  the  cache  and  the  store-controller.  The 
"status"  column  of  the  cache  directory  shown  in  Figure  3(b)  needs  soma 
explanation.  The  status  of  a  line  can  be  one  of  three  things: 

1.  Private:  For  a  line  which  has  been  modified  (with  respect  to 
main  memory)  or  is  going  to  be  modified.  A  private  line  exists  in 
only  one  cache. 

2.  Non-Privates  For  a  line  that  exists  in  one  or  more  caches  and 
which  has  not  been  modified  with  respect  to  main  memory. 

3.  Invalid:  For  a  line  that  has  been  modified  elsewhere  and  thus 
becomes  outdated.  An  invalid  line  is  considered  "not  in  the 
cache." 

In  Figures  4  and  5  the  READ  and  WRITE  operations  are  illustrated 
respectively. 

The  two  major  drawbacks  of  implementing  this  approach  in  a  highly 
parallel  multi -processor  computer  system  such  as  INPOPLEX  where  the 
number  of  caches  is  relatively  large  are: 
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1.  The  size  of  the  central  directory  becomes  too  large ,  which 
means  increased  time  in  processing  it  and  increased  costs  in 
building  it. 

2.  The  store-controller  oould  became  a  bottleneck  in  the  system 
as  the  traffic  between  itself  and  the  cadres  becomes  very  large. 


111,3.  The  "MULT ICS 11  Approach 

This  approach,  which  is  used  in  Honeywell. s  Multics  computer 
system  (Greenberg,  1978) ,  has  two  important  features.  Firstly,  the 
Multics  cadre  is  a  "store  through"  cache.  This  means  that  the  cadre 
and  main  manor y  are  updated  simultaneously.  The  second  feature  is  that 
the  system  address  spaoe  is  divided  into  segments,  each  of  which  lias 
associated  with  it  names  and  per-user  access  rights  which  govern  the 
ability  of  each  potential  user  of  the  segment  to  read  and/or  write  its 
oontents  (i .e . ,  words) . 

Every  segment  is  known  by  the  system  to  be  either  "writable"  or 
"non-wri table."  The  non-wri table  class  of  segments,  that  is  those  to 
which  no  users  have  write  access,  is  an  important  and  statistically 
significant  one  in  MULTICS.  All  procedures,  including  all  parts  of  the 
operating  system,  utilities,  libraries,  translators,  and  so  forth,  fall 
into  this  category.  These  segner.ts  are  "encacheable"  by  all 
processors.  This  means  that  their  oontents  (or  words)  are  allowed  to 
"migrate"  to  any  or  all  caches.  Notice  that  when  such  words  find  their 
way  into  a  cache  (or  more  than  one  cache)  there  is  no  possibility  for  a 
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consistency  problem  to  arise  {foe  such  words)  since  no  processor  can 
write  or  modify  them. 

For  the  other  class  of  segments,  those  which  are  writable,  the 
situation  is  slightly  more  complicated.  For  a  segment  falling  in  this 
category,  there  are  tnree  possible  states: 

1.  One  or  more  processes  (users)  are  accessing  the  segment  but 
none  of  them  has  a  write  access  to  it.  The  segment  is  encacheable 
by  all  processors  but  no  consistency  problem  will  arise. 

2.  One  or  more  processes  are  accessing  the  segment,  with  at  least 
one  of  them  having  write  access.  The  segment  becomes 
non-encacheable  i.e.,  its  words  cannot  migrate  to  any  of  the 
caches.  The  consistency  problem  will  not  arise  here  also  since 
there  will  only  be  one  oopy  (i.e.,  the  one  in  main  memory) . 

3.  Only  one  process  that  has  write  access  is  accessing  the 
segment.  The  segment  is  encacheable  only  to  that  process.  And  it 
is  only  in  this  third  state  that  the  consistency  problem  oould 
possibly  arise.  Consider  the  following  scenario: 

Assume  a  certain  segment  SI  is  addressable  by  only  one 
process  PR0C1  which  is  currently  running  for  the  first  time 
on  processor  CPU1.  Assume  also  that  PR0C1  has  write  access 
to  SI.  Thus  SI  is  encacheable  by  CPU1.  As  long  as  CPU1  runs 
PR0C1,  words  of  SI  may  be  drawn  into  CPUl.s  cache  and  be 
modified  by  CPUl  with  no  problem.  During  this  period,  no 
other  processor  can  address  SI,  for  by  assumption  it  is 
aduressable  only  by  PR0C1,  which  is  uniquely  associated  with 
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CPU1  during  the  interval  in  question.  Thus,  it  is  impossible 
for  other  processors  to  draw  words  of  SI  into  their  caches  as 
long  as  PR0C1  is  associated  with  CPUl.  Until  CPU1  leaves 
PRQC1,  there  is  thus  no  danger  of  words  of  Si  in  CPUl.s  cache 
becoming  outdated,  as  no  other  processor  can  address  Si. 
Similarly,  there  cannot  be  words  of  SI  in  any  other 
processor. s  cache,  for  by  assumption,  CPUl  was  the  first  and 
only  processor  to  run  PR0C1.  Thus,  there  is  no  danger  that 
modifications  to  words  of  Si  made  by  CPUl  can  invalidate 
copies  in  other  processors,  caches,  since  such  copies  cannot 
exist.  Potential  difficulty  arises  when  CEUl  has  left  PR0C1, 
and  some  other  processor  attempts  to  run  PRCC1.  The  first 
time  this  happens,  there  is  no  problem.  Since  all  words  in 
main  memory  are  accurate,  by  virtue  of  the  store-through 
cache,  another  processor,  say  CHJ2,  cannot  have  inaccurate 
data,  or  main  memory  is  accurate,  and  we  have  just  shown  how 
CPU2.S  cache  may  not  contain  inaccurate  data.  However,  while 
PR0C1  runs  on  CPU2,  CPU2  may  modify  words  of  SI  in  its  own 
cache  and  in  main  memory.  Still  there  is  no  problem.  Main 
memory  is  accurate,  as  is  CPU2.S  cache.  This  can  go  on  like 
this  as  long  as  processors  which  have  never  ran  PR0C1  (since 
PR0C1  started  running)  run  it.  However,  the  first  time  seme 
processor  which  has  already  ran  PR0C1  since  then  attempts  to 
run  it  again,  the  scheme  appears  to  break  down.  Assume  CPUl 
attempts  to  run  PR0C1  for  the  second  time.  There  may  be 
words  of  SI  in  CPUl.s  cache  from  the  previous  time  CPUl  ran 
PR0C1.  Some  of  these  words  may  have  been  modified  by  PR0C1 
while  it  ran  on  CFU2.  Thus,  these  words  are  accurate  in  main 
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memory  ana  in  CPU2.S  cache,  but  are  inaccurate  in  CPUl.s 
cache,  for  CPU2  had  ro  way  of  knowing  or  acting  upon  the  fact 
that  they  were  in  CPUl.s  cache. 

The  MULTICS  solution  to  "its"  consistency  problem  is  simples 
clear  the  cache  of  a  processor  upon  entering  a  process  if  it  was  not 
the  last  processor  to  run  that  process.  This  is  performed  by  the 
MULTICS  process  dispatcher,  with  a  special  processor  instruction  that 
accomplishes  this  task.  This  ensures  that  no  words  of  any  per-process 
writable  segment  will  be  found  in  a  processor. s  cache  if  there  was  any 
possibility  that  any  of  those  segments  may  have  been  modified  by  other 
processors.  The  operating  system  maintains  in  the  control  block 
describing  each  process  the  identity  of  the  last  processor  to  have  run 
this  process  thus  this  check  is  easy  to  make  when  a  processor  is 
dispatched  into  a  process. 

From  an  I NFOPLEX- type-system  view  point,  there  are  three  major 
drawbacks  to  the  MULTICS  approach,  all  of  which  are  performance 
related : 


1.  The  class  of  segments  that  are  non-encacheable  can  be  of 
significant  size,  and  thus  dampening  the  performance  gains  sought 
by  the  cache  organization.  Note  that  in  MULTICS  there  is  the 
significant  class  of  segments  which  we  termed  "non-wri table"  ai.^ 
which  are  encacheable .  In  data  base  computers,  like  I NFOPLEX, 
this  category  which  contains  things  like  utilities,  libraries,  and 
translators  will  probably  be  much  smaller,  and  thus  decreasing  the 
portion  of  encacheable  segments.  (There  oould,  however,  be 
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special  applications  where  this  doesn.t  apply  e.g.,  READ-only  data 
base  applications.)  In  addition,  the  MULTICS  approach  doesn.t 
discriminate  between  two  processes  who  although  both  have  write 
access  to  a  segment,  one  actually  exercises  the  "right"  and 
modifies  the  segment,  while  the  other  doesn.t.  In  both  cases  the 
segnent  will  become  non-encacheable  if  there  are  other  processes 
that  are  also  reading  it.  This  means  that  in  both  cases  the 
system. s  speed  will  be  slowed  down  to  the  speed  of  main  memory. 

2.  Using  a  store-through  algorithm  has  its  own  performance 
disadvantages.  As  was  stated  earlier,  it  limits  the  speed  of 
store  operations  to  that  of  main  memory,  and  thus  defeating  the 
very  purpose  of  using  a  cache . 

3.  The  procedure  of  clearing  up  the  cache  is  also  a  wasteful  one. 
Note  that  a  cache  is  always  cleared  if  its  processor  wasn.t  the 
last  one  to  run  the  process.  All  access  requests  to  the  cleared 
cache  contents  that  would  have  otherwise  been  serviced  by  the 
cache,  must  now  wait  for  transfers  from  main  meemory.  This  will, 
obviously,  slew  down  the  system. 


III. 4.  Conclusion 

We  have  analyzed  the  three  oocrmon  approaches  that  have  been 
proposed  in  the  literature  to  handle  the  consistency  problem  in 
multicache  systems.  We  have  considered  each  approach  in  the  context  of 
the  INFOPLEX  framework  presented  earlier.  Within  this  oontext  we  were 
able  to  identify  some  major  drawbacks  in  each  of  the  approaches. 


In  the  next  section  we  propose  a  new  approach  to  handle  the 
cache-consistency  problem  in  multi -processor  architectures.  In  Part 
(V)  we  will  evaluate  the  performance  of  this  proposed  approach. 
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IV.  THE  "OOMMON-C/CHE  /  PENDED  TRANSACTION  BUS"  APPROACH 


IV. 1  Introduction 

We  argued  in  the  beginning  of  Part  (I)  that  in  a  single-cache 
memory  system  (Figure  6(a)),  where  there  is  only  one  access  path 
between  each  level,  no  cache-consistency  problem  will  arise.  Clice  two 
or  more  caclies  cure  used,  however,  the  potential  for  the  problem 
develops. 

The  "traditional"  approach  in  employing  caches  in  multi -processor 
systems  has  been  to  basically  replicate  the  structure  of  Figure  6(a) 
for  eacn  of  the  processors,  as  shown  in  Figure  6(b),  and  then  solve  any 
problems  that  arise.  A  problem  that  arises,  of  course,  is  the 
cache-consistency  problem,  and  the  three  basic  approaches  that  have 
been  developed  to  handle  it  are  those  of  Part  (III) . 

Implementing  any  of  the  three  approaches  will  obviously  constitute 
some  processing  overhead.  As  an  example,  consider  the 
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"Store-Controller"  approach  and  refer  in  particular  to  the  flow-chart 
of  Figure  5.  Wien  a  CPU  needs  to  modify  a  line  that  exists  in  its 
caclie ,  and  the  line  happens  to  be  a  "non-private"  line,  then  the  status 
of  the  line  is  first  changed  to  "private"  and  the  "modified  flag"  is 
set.  Next,  the  Store-COntroller .s  directory  is  clicked,  and  if  the 
line  is  found  not  to  be  shared  by  other  caches  it  is  modified.  If, 
however,  it  happens  to  be  shared,  then  messages  are  sent  to  the 
appropriate  caches  to  invalidate  the  line,  the  central  directory  is 
updated,  and  finally  the  line  is  modified.  How  much  overhead  did  we 
incur  to  maintain  consistency?  Well,  compare  the  above  steps  with 
those  needed  for  a  uniprocessor  system  where  the  cache-consistency 
problem  does  not  arise.  In  such  a  system,  when  the  CPU  needs  to  modify 
a  line  that  exists  in  its  cache,  it  simply  proceeds  and  modifies  it. 
None  of  the  above  checks,  updates  and  messages  are  needed. 

The  concern  over  the  processing  overhead  needed  to  maintain 
consistency  is  a  legitimate  one.  Such  overhead  does  undoubtedly  dampen 
the  performance  gains  (e.g.,  system  throughput)  which  are  sought  by 
introducing  caches  in  the  first  place.  Attempting  to  minimize  this 
overhead,  therefore,  seems  an  attractive  direction  for  research  work. 

In  this  research  endeavor,  we  are  basically  proposing  an 
architecture  that  eliminates  the  processing  overhead  associated  with 
handling  the  multicache-consistency  problem.  The  architecture  is 
depicted  in  Figure  6(c) .  The  important  distinction  between  this 
organization  and  that  of  Figure  6(b)  is  that,  here,  each  of  the  cache 
modules  is  accessed  by,  and  thus  services,  all  of  the  processors. 

Thus,  just  as  main  memory  is  common  to  and  shared  by  all  processors, 
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the  cache  level  in  our  proposed  arcliitecture  is  also  cannon  to  and 
shared  by  all  processors.  In  such  a  scheme  there  is  no  need  to  store 
more  than  a  single  copy  of  any  data  item  at  the  cache  level.  This 
eliminates  redundancy.  And  with  no  redundancy  at  the  cache  level,  no 
inconsistencies  can  obviously  arise,  which  in  turn  eliminates  the  need 
for  mechanisms  to  maintain  cache-consistency  and  the  overhead 
associated  with  such  mechanises . 

It  is  important  bo  note  that  this  architecture  preserves  the  basic 
intent  behind  cache-based  systems,  namely,  using  a  high  speed  memory 
level  between  the  CPU  and  main  memory  in  order  to  bridge  the  speed  gap 
between  the  two.  It,  however,  introduces  the  ooncern  as  to  whether  the 
CPU/cache  bus  (sec  Figure  6(c))  can  handle  the  needed  high  volume  of 
traffic.  Comparing  the  Private  Cache  (PC)  architecture  of  Figure  6(b) 
against  the  Ccwmon  Cache  (OC)  architecture  of  6(c) ,  we  note  that  the 
CHJ/cache  bus  in  pc  handles  the  transaction  load  generated  by  a  single 
processor  where  as  in  OC  it  handles  the  load  generated  by  all  the  CPUs. 
We  can,  therefore,  expect  that  the  load  on  the  CPU/cache  bus  in  our 
preposed  (OC)  architecture  to  be  close  to  N-times  that  of  the  PC 
archi tecture ,  where  N  is  the  number  of  processors.  Of  course,  the  load 
will  be  somewhat  less  than  exactly  N-times  because  in  PC  seme  overhead 
traffic  will  be  generated  by  the  mechanism  used  to  handle  the 
cache-consistency  problem,  and  which  will  not  be  needed  in  (OC) . 

Thus,  to  re-state,  the  Common  Cache  approach  eliminates  the 
cache-con si stency  problem  (i.e.,  by  eliminating  private  caches)  but 
there  is  a  fundamental  ooncern  that  the  CPU/cache  bus  will  develop  into 
a  bottleneck  that  degrades  the  system. s  performance.  In  the  next 
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section  we  introduce  the  Pended  Transaction  Dus  Protocol,  which  we 
believe  provides  the  basis  for  a  viable  solution  to  the  problem. 


IV.  2  The  Pended  Transaction  Bus  (PTB) 

The  degree  of  bus  utilization  of  a  processor  is  a  function  of  the 
physical  characteristics  of  the  bus,  and  the  protocol  used  on  it.  The 
physical  characteri sties  of  the  bus  which  include  the  length,  voltage 
levels,  impedence,  termination,  capacitance,  noise  immunity,  and 
overall  reflection  characteristics,  affect  its  operating  speed. 
Ultimately,  any  bus  is  limited  by  the  speed  of  electricity  along  a 
conductor  (0.6  to  0 . 9  nanoseconds  per  foot  typically,  depending  on  wire 
characteristics)  .  Careful  electrical  analysis  and  physical  layout  can 
optimize  these  parameters  to  achieve  reasonable  electrical  speed. 

Given  an  electrical  bandwidth  of  the  bus,  as  formed  by  the  bus 
wires  and  the  driving  logic,  the  actual  data  bandwidth  beoentes  a 
function  of  the  protocol  used  on  it.  The  traditionally  high  bus 
utilizations  of  multi -microprocessor  architectures  that  employ  the 
single  bus  as  their  interconnection  scheme  is  primarily  a  result  of  the 
bus  protocol  used,  and  which  is  called  the  "master-slave"  protocol.  In 
such  a  protocol,  the  CPU  (master)  asserts  a  request  on  the  bus,  and  the 
memory  (slave)  that  receives  it  does  the  appropriate  action  (e.g.,  a 
HEAD) ,  and  then  responds  (with  the  requested  data) .  The  bus  is  viewed 
as  "busy"  during  this  entire  time.  A  large  portion  of  this  time  Is 
actually  spent  waiting  for  the  slave  co  complete  the  requested  action. 


The  electrical  and  logical  time  to  transmit  the  actual  request  and 
return  the  reply  are  a  relatively  small  portion  of  the  total  bus  usage 
time.  The  period  of  time  between  the  request  and  the  acknowledge  is  a 
wait  interval,  and  no  useful  work  is  done  with  the  bus  during  this 
time.  Bus  utilization  could  be  significantly  reduced  if  we  released 
the  bus  during  the  wait  interval.  To  do  this  we  split  the  transaction 
into  two  parts,  a  request  part  and  a  reply  part.  The  master  requests 
the  bus,  and  upon  being  granted  it,  sends  the  request  to  the  slave  at 
maximum  speed.  The  slave  acknowledges  reception  of  the  request,  and 
starts  to  work  on  it.  The  processor  then  releases  the  bus  and  waits 
for  tlie  results.  When  the  slave  completes  its  task,  it  asks  for  the 
bus,  and  upon  receiving  it  sends  the  reply  back  to  the  originating 
master  at  maximum  speed.  The  time  between  the  request  and  reply  can  be 
used  by  other  masters  and  slaves  to  transfer  other  messages  over  the 
bus.  Because  the  slave  stores  the  incomplete^  request  from  its  master, 
it  is  called  a  pended  transaction,  and  the  bus  protocol,  developed  at 
M.I.T.,  is  called  a  Pended  Transaction  Bus  (PTB)  protocol  (Toong,  et 
al,  1S80) . 

In  the  above  discussions,  it  was  presumed  that  the  slaves  were 
always  able  to  accept  the  master  requests  when  presented.  This  would 
imply  that  each  master  was  using  different  slaves  to  guarantee  such 
separation.  Given  the  shared  nature  of  the  caclie  data,  it  is  likely 
that  two  (or  more)  masters  may  make  requests  to  the  same  cache-memory 
slave.  The  sinplest  scheme  to  resolve  this  contention  problem  is  for  a 
busy  slave  to  refuse  a  new  request  and  make  the  requesting  master  retry 
at  a  later  time  when  the  slave  beoomes  free.  This,  however,  is 
wasteful  of  bus  bandwidth,  since  the  time  spent  to  send  a  request  out 
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the  first  time,  only  to  be  refused,  is  not  useful.  Furthermore,  the 
slave  may  still  be  busy  when  the  master  tries  again  later. 

Goodrich  (Goodrich  II,  1S80)  has  proposed  putting  queues  on  the 
inputs  of  the  slaves  bo  buffer  requests.  That  is,  any  requests  that 
come  while  the  slave  is  busy  would  be  placed  in  a  first-in-first-out 
queue,  and  these  requests  would  be  serviced  in  order  when  the  slave  ia 
able  to  handle  them.  Such  a  solxsme  would  reduce  the  bus  load  to  what 
is  actually  needed  for  transmission,  without  any  extra  cycles.  In 
addition,  it  would  also  improve  the  slave  response  time  as  seen  by  the 
master  over  the  simple  scheme  described  before,  since  the  slave  would 
have  the  request  in  its  queue,  and  would  service  it  as  fast  as  it 
could,  not  just  when  the  master  is  finally  successful  in  transmitting 
it.  Note  that  a  queue  overflow  need  not  be  fatal  in  thir  system.  It 
can  be  treated  like  a  refusal  in  the  previous  scheme,  and  would  simply 
require  the  master  to  retransmit  the  request  later.  If  the  queue  size 
is  sufficiently  large,  such  refusals  would  be  rare.  Finally,  for  best 
slave  throughput,  there  should  also  be  a  queue  on  the  output  of  each 
slave. 


IV. 3  An  Application:  The  INFOPLEX  Storage  Hierarchy 

In  the  above  sections  we  have  introduced  an  approach  to  the 
cache-consistency  problem  in  highly  parallel  multi-processor  computer 
systems,  which  we  will  call  the  ''Gommon-Cachc  /  Pended  Transaction  Dus" 
approach  (CC/PTB) .  Its  two  distinctive  features  are:  Firstly,  the 
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conventional  one  to  one  relationship  between  processors  and  caches 
i.e.,  where  each  CPU  has  its  cwn  private  cache,  is  replaced  by  an 
N-to-M  relationship,  where  a  pool  of  cache -modules  is  commonly  shared 
by  all  processors,  secondly,  we  propose  the  use  of  the  Pended 
Transaction  Bus  (PTB)  as  the  interconnection  scheme  that  connects  the 
processors  and  the  cache-modules. 

In  this  section  we  describe  how  the  CC/PTB  architecture  can  be 
incorporated  into  the  storage  hierarchy  of  the  INFGPLEX  data  base 
computer.  Schematically  the  proposed  architecture  would  look  like 
Figure  7(b).  We  will,  henceforth,  refer  to  this  architecture  as 
INFOPLEVOC  (for  Ooimon  Cache)  while  referring  to  the  original  Private 
Cache  organization  (shewn  in  Figure  7(a) )  as  the  INFOPLEtyPC 
archi tecture . 

The  key  INPOPLEX  storage  hierarchy  operations  are  the  READ  and 
WRITE,  m  INPOPLEVpc  two  strategies  are  reeded  to  implement  these 
operations,  a  strategy  for  the  cache  level  and  another  for  all  other 
levels.  The  reason  for  this  is  manifested  in  Figure  7(a) ,  where  it  can 
be  seen  that  the  organization  of  the  cache  level  is  different  from  the 
other  levels  of  the  hierarchy.  In  INFOPLEX/OC,  on  the  other  hand,  the 
cache  level  organization  is  very  similar  to  that  of  the  lower  storage 
levels.  As  a  result,  the  strategy  used  to  implement  the  READ  and  WRITE 
operations  oould  be  the  same  for  all  levels  of  the  hierarchy  with  very 
minor  provisions  to  account  for  the  few  differences  that  do  still 
distinguish  the  cache  level.  (For  example,  the  absense  of  an  MRP  at 
the  cache  level.) 
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We  will  attempt  new  to  explain  briefly  hew  the  basic  READ  and 
WRITE  operations  will  be  implemented  in  the  INPOPLEtyOC  architecture. 

To  a  large  extent  this  will  be  based  on  the  work  of  Lam  (Lam,  7S) , 
where  a  much  more  detailed  and  complete  discussion  is  presented. 

All  READ  and  WRITE  operations  are  performed  in  the  highest  storage 
level  L(l)  i.e.,  the  cache  level.  If  a  referenced  data  item  is  not  in 
L(l),  it  is  brought  up  to  L(l)  from  a  lower  storage  level  via  a 
READ-THROUGH  operation.  The  effect  of  an  update  to  a  data  item  in  L(l) 
is  later  propagated  down  to  the  lower  storage  levels  via  a  number  of 
STORE-BEHIND  operations. 

When  a  READ  request  is  issued  by  a  processor,  the  cache  level  is 
checked  to  see  if  the  requested  data  is  in  it.  If  the  data  is  found  in 
a  cache-module,  it  is  retrieved  and  returned  to  the  processor.  If, 
however,  the  requested  data  is  not  found  in  the  cache  level,  a 
READ-THROUGH  request  is  queued  to  be  sent  to  the  next  lower  level  L(2) 
via  the  Storage  Level  Controller  (SLC) .  As  mentioned  in  Part  (II)  the 
SIC  serves  as  a  gateway  between  the  storage  levels  of  the  hierarchy. 

At  a  storage  level,  a  RE AD -THROUGH  request  is  handled  by  the 
Memory  Request  Processor  (MRP) .  An  MRP  performs  the  address  mapping 
function.  It  contains  a  directory  of  all  the  data  maintained  in  the 
storage  level.  Using  this  directory,  the  MRP  can  determine  if  the 
requested  data  is  in  one  of  the  storage  devices  at  that  level.  If  the 
data  is  not  in  the  storage  level,  the  READ-THRClXH  request  is  queued  to 
be  sent  to  the  next  lower  storage  level  via  the  Storage  Level 
Controller  (SLC) . 
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If  the  data  is  found  in  a  storage  level  L(i) ,  the  M<P  maps  the 
main  memory  address  of  the  requested  data  item  into  its  real  address  in 
L (i)  .  This  real  address  is  used  by  the  appropriate  storage  device  to 
retrieve  the  block  containing  the  data  and  then  passes  it  to  the  SDC. 
The  SLC  would  then  broadcast  the  block  to  all  upper  storage  levels  by 
dividing  it  into  fixed  size  packets.  Each  upper  storage  level  has  a 
buffer  to  receive  these  packets.  A  storage  level  only  collects  those 
packets  that  assemble  into  a  sub-block  of  an  appropriate  size  (peculiar 
to  the  storage  level)  that  contains  the  requested  data.  This  sub-block 
is  then  stored  in  a  storage  device.  At  L(l),  the  sub-block  (i.e.,  the 
cache  line  in  this  case)  containing  the  requested  data  is  stored,  and  ' 
the  data  is  finally  sent  to  the  processor  that  initiated  the  request. 

In  a  WRITE  operation,  the  data  item  is  written  into  a 
cache-module,  and  the  processor  is  notified  of  the  completion  of  the 
WRITE  operation.  We  shall  assume  that  the  data  item  to  be  written  is 
already  in  L(l) .  (This  can  be  realized  by  reading  the  data  item  into 
L(l)  before  the  WRITE  operation.)  A  STORE-BEHIND  operation  is  next 
generated  by  the  cache-module  and  sent  to  the  next  lower  storage  level. 
INFQPLEX  uses  a  two- level  STORE-BEHIND  strategy.  This  strategy  ensures 
that  in  a  hierarchy  with  N  levels,  an  updated  block  will  not  be 
considered  for  eviction  from  a  storage  level  L(i) ,  until  its  "parent" 
blocks  at  levels  L(i+1)  and  L(i+2)  are  updated.  This  scheme  will 
ensure  that  at  least  two  copies  of  the  updated  data  exist  in  the 
storage  hierarchy  at  any  time.  The  motivation  behind  using  such  a 
strategy  is  two-fold.  Firstly,  the  reliability  of  the  system  is 
enhanced  because  at  least  two  copies  of  newly  written  data  are  always 
maintained  until  the  data  is  "securely"  copied  at  the  lowest  level  of 
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the  hierarchy.  Furthermore,  the  STORE-BEHIND  strategy  allows  for  the 
updating  of  lower  storage  levels  to  be  carried  out  at  slack  periods  of 
system  operation,  thus  enhancing  performance. 

In  the  above  discussions  we  didn.t  show  hew  a  specific 
cache-module  can  be  oorrectly  selected  to  handle  a  READ  or  a  WRITE 
operation  of  a  particular  data  item.  In  all  but  the  cache  level  this 
function  is  performed  by  the  Memory  Request  Processor  (MRP) ,  which  map6 
main  memory  addresses  into  their  physical  equivalents  at  a  particular 
level.  What  we  need,  therefore,  is  to  augment  the 
processors/oommon-caclie  interface  to  translate  main  memory  addresses  to 
physical  addresses  in  the  cache-modules. 

There  are  two  possible  places  to  perform  the  translation 
operation:  at  the  processor  interface  and  at  the  cache  interface.  At 
the  processor  interface,  the  address  gets  translated  before  it  reaches 
the  bus.  This  requires  that  each  processor  be  "informed"  about  all 
current  lines  at  the  cache  level.  This  information  would  be  used  to 
translate  all  the  main  memory  addresses  of  these  lines  to  their 
equivalents  at  the  cache  level.  An  advantage  of  this  is  that  the 
translation  can  be  performed  while  the  procesor  is  arbitrating  for  the 
bus,  and  if  the  translation  operation  is  fast  enough,  it  can  be  done 
without  any  access  time  penalty. 

In  the  second  possible  scheme  the  address  gets  translated  at  the 
cache-module  interface.  What  would  be  needed  here  is  a  mechanism 
incorporated  in  the  recognition  circuitry  of  each  cache-module  that 
would  translate  the  main  memory  address  put  on  Jie  bus,  and  that  would 
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accordingly  decide  IF  the  desired  data  item  is  present  in  the  cache 
level  and  if  so  WHERE  i.e.  in  which  cache-module  nd  in  which  location 
in  it.  Both  these  should  be  done  as  fast  as  possi  >lc.  Tag  directory 
schemes  incorporating  set  associative  mapping  are  suggested  by  Mattick 
(Mattick,  1977) . 

In  Part  (V)  we  will  evaluate  the  performance  of  both  these  schemes 
when  implemented  at  the  cache  level  of  the  INFOPLEX  storage  hierarchy. 
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V.  EVALUATING  THE  PERFORMANCE  OF  THE  "CCM^-CACHE  /  PENDED 
TRANSACTION  BUS"  APPROACH 


V. 1 .  Introduction 

To  evaluate  the  performance  of  our  CC/PTB  approach  we  used  the 
INPOPLEX  multi-level  storage  system  as  our  test  case.  Very  slight 
modifications  bo  the  existing  design  were  needed  to  incorporate 
"CC/PTB"  into  the  INPOPLEX  storage  stucture.  For  example,  instead  of 
using  the  four  level  memory  hierarchy  studied  previously  by  Lam  (Lam, 
lS7Sh)  just  two  levels,  the  cache  level  and  main  memory  were  sufficient 
for  our  purposes. 

The  "QC/PTB"  approach  was  compared  against  two  benchmarks. 

Firstly,  we  evaluated  the  performance  of  the  "Store-Controller" 
approach  as  a  representative  of  the  traditional  approaches.  Selecting 
the  "Store-Controller"  approach  to  evaluate  our  scheme  against  cannot, 
however,  provide  an  effective  answer  to  the  question  of  how  "optimal" 
ei ther  approach  is.  What  is  needed  is  an  "absolute"  benchmark  against 
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which  both  approaches  could  be  judged.  Por  this  purpose  we  evaluated 
the  performance  of  a  traditional  private  cache  architecture  assuming  no 
overhead  for  handling  the  cache-consistency  problem.  This,  then, 
constitutes  the  "performance  ceiling"  that  no  cache-consistency 
handling  mechanism  (for  INFOPLEX)  could  possibly  exceed.  It  is  also  a 
valuable  reference  point  in  that  it  tells  us  how  close  we  are  to  an 
"optimal"  solution. 


V.2.  The  Evaluation  Tool 


The  evaluation  was  performed  by  producing  a  simulation  model  in 
GPSS .  We  developed  four  separate  GPSS  programs: 

1.  Program  "(XT"  ignores  the  cache-consistency  problem,  and  thus 
incorporates  no  overhead  for  handling  it.  It  provides  us  with  a 
ceiling  cn  performance. 

2.  Program  "STCR"  incorporates  the  "Store-Controller"  approach. 

3.  Program  "OC/PTB /C"  incorporates  .he  "CC/PTB"  approach  with  the 
translation  operation  performed  at  the  cache-interface. 

and  4.  Program  "OC/PTB/P”  incorporates  the  "CC/PTB"  approach  with  the 
translation  operation  performed  at  the  processor  interface. 

The  GPSS  code  for  each  of  the  above  four  programs  is  presented  in 
a  separate  Appendix  (Appendices  I  through  IV) .  All  four  programs  are 
highly  detailed  simulators.  A  widely  used  index  that  characterizes  the 
degree  of  detail  of  a  simulation  model  is  its  resolution  in  time,  which 
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is  defined  as  the  shortest  interval  of  simulated  time  between  two 
consecutive  events  being  considered  (Ferrari,  1978).  In  our  four 
simulations,  the  resolution  in  time  is  10  nanoseoonds .  Such  detailed 
simulators  are  likely  to  be  more  accurate  and  to  have  broader  field  of 
application  than  less  detailed  ones.  However,  they  are  certainly  more 
expensive  to  design,  implement,  test,  document,  and  use. 

In  addition  to  deciding  on  the  degree  of  detail  of  the  models, 
another  basic  decision  had  to  be  made  concerning  the  workload  that 
would  drive  the  simulations.  Me  decided  to  perform  our  measurements  in 
an  operating  envi raiment  that  is  not  at  all  uncommon  in  present-day 
computer  systems,  namely,  running  at  capacity.  Under  such  conditions, 
there  will  always  be  at  least  one  transaction  in  the  system. s  input 
queue (s)  waiting  to  be  serviced,  in  such  a  case,  the  performance 
characteristics  that  are  of  interest  to  us,  such  as  throughput  per  unit 
time,  beoome  insensitive  to  the  distribution  of  job  arrivals. 

Before  concluding  this  section  we  would  like  to  emphasize  seme  of 
the  structural  differences  between  the  four  models,  as  well  as  some  of 
their  common  characteri sties.  The  basic  structural  differences  are 
highJ  ighteo  in  the  diagrams  of  Figure  8.  In  Figure  8(a)  a 
two-s  tor  ago- level  version  of  the  INFOPLHX  storage  hierarchy  proposed  in 
(Lorn,  l‘J7‘b)  is  shown.  To  support  the  "Store-Controller"  approach  an 
SC  "box"  (dotted  in  Figure  8(a))  is  added  to  the  architecture.  In 
Figure  8(b)  the  architecture  we  proposed  in  Part  IV  to  support  the  two 
versions  of  the  "GC/PTB"  approach  is  incorporated  into  the  INFOPLEC 
storage  system.  Notice  that  in  the  CC/PTB  architecture  we  deliberately 
maintained  the  same  number  of  caches  (5)  as  in  Figure  8(a) .  It  is 
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important  to  realize  that  while  a  1:1  relationship  between  the  CPUs  and 
the  caches  is  inherent  in  the  Private  Cache  (PC)  architectures,  such  is 
not  the  case  for  the  Common  Cache  (CC)  architectures.  Vie,  however, 
chose  to  maintain  the  same  nunber  of  caches  in  both  architectures  (and 
four  simulation  models)  to  neutralize  it  as  a  factor  that  might  affect 
performance. 


Finally,  there  is  a  set  of  cannon  characteristics  that  are  shared 
by  all  four  models,  these  are: 


Degree  of  Multiprogramming  of  a  CPU 
Bus  Width 

Size  of  Transaction  without  data 
Size  of  Transaction  with  data: 
at  Level  1 
at  Level  2 

Size  of  Data  Buffers 


=  10 

=  8  bytes 
*  8  bytes 

=  8  bytes 
=  64  bytes 

=  10  64-byte  transactions 


Finally,  the  Pended  Transaction  Bus  (PTB)  protocol  will  be  used  in 
all  three  architectures  (and  four  models) .  That  is,  we  are  committed 
to  the  PTB  protocol  in  INFCPLEX  as  a  result  of  our  experimental  work 
(at  M.I.T.),  which  demostrated  its  performance  advantage. 


V.3.  The  Simulation  Experiment: 

^r  criterion  for  measuring  performance  in  this  experiment,  and 
which  will  also  serve  as  our  dependent  variable,  will  be  the  total 
system  throughput  as  measured  by  the  total  nunber  of  transactions 
processed  per  unit  of  time.  As  for  the  independent  variable,  there 
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were  two  candidates:  (1)  the  hit  ratio,  which  is  the  percentage  of 
time  that  a  referenced  data  item  is  found  in  the  caclie  level;  and  (2) 
the  transaction  mix  i.e.,  the  percentage  of  READ  requests  versus  WRITE 
requests . 

The  latter  was  chosen  because  we  felt  it  is  in  a  sense,  a  more 
independent  variable.  What  we  mean  is  that  the  transaction  mix  is 
largely  a  function  of  the  use  of  the  system  and  as  such  we  have  very 
little  control  over  it.  The  hit  ratio,  on  the  other  hand,  is 
system-dependent.  It  can  be  affected  by  manipulating  such  system 
parameters  as  the  total  size  of  the  cache  level  and  the  size  of  the 
individual  cache  blocks. 

Thus,  the  hit  ratio  will  be  held  constant,  throughout  the 
experiment,  at  a  value  of  0.S0  for  READ  requests  and  1.00  for  WRITE 
requests.  (That  is,  we  are  assuming  that  a  WRITE  of  a  data  item  is 
always  preceded  by  a  READ  to  it.)  The  transaction  mix  i.e.,  the 
percentage  of  READs,  will  be  allowed  to  vary  in  the  range  from  70  %  to 
90  %.  This  is  the  range  in  which  the  mean  READ  rate  lies  for  most 
processor  architectures  (Censier  and  Feautrier,  1978). 

Both  the  hit  ratio  and  the  transaction  mix  are  variables  that 
affect  the  performance  of  the  system  but  which  are  independent  of  the 
mecnanisms  used  to  handle  the  cache-oonsistency  problem.  There  are, 
however,  variables  that  are  peculiar  to  the  particular  mechanism  used, 
and  which  influence  the  system. s  performance  significantly. 


In  the  "Store -Controller"  approach  the  degree  of  sharing  between 


the  caches  is  by  far  the  most  important  such  variable.  We  evaluated 
the  "Store-Controller"  approach  under  two  operational  modes,  a 
"pessimistic"  mode  and  an  "optimistic"  mode.  Under  the  optimistic  mode 
no  sharing  between  caches  takes  place,  yielding  an  upper  bound  on  the 
performance  of  this  approach.  Under  the  pessimistic  mode  a  high  degree 
of  sharing  will  be  introduced  (50  %  of  the  cache  blocks  will  be  shared 
by  more  than  cne  cache-modu.le) .  'Phis  will  then  provide  us  with  a 
conservative  lower  bound  on  the  performance  of  the  "Store-Controller" 
approach. 

With  respect  to  the  "CC/PTB"  approach  we  v  11  also  follow  the 
above  strategy,  and  evaluate  it  under  both  "pessimistic"  and 
"optimistic"  conditions.  Under  the  optimistic  condition  we  will  assume 
the  load  on  the  cache-modules  to  be  uniformly  distributed  (i.e.,  each 
of  the  5  cache-modules  carries  20  %  of  the  load) .  And  for  the 
pessimistic  case  we  will  assume  the  load  to  be  linearly  distributed 
between  10  %  at  the  least  loaded  cache-module  and  30  %  at  the  most 
loaded. 


V.4.  The  Hardware  Parameters 


It  is  necessary  to  determine  the  speeds  of  the  different  hardware 
components  in  the  INFOPLI5X  storage  hierarchy.  In  particular,  what  we 
sought  were  the  best  possible  1?05  projections  for  these  speeds,  since 
the  first  hardware  prototype  of  the  INPOPLEC  data  base  computer  was  not 
expected  before  then. 
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The  forecasts  shewn  below  constitute  a  realistic,  but  ambitious, 
scenario  for  1S85.  In  other  words,  they  incorporate  the  fastest 
possible  components  that  we  envision  as  being  available  (and 
appropriate)  for  building  an  INFOPLEX  in  1585.  (Only  a  subset  of  the 
parameters  are  shown.)  The  bus  speed  (b)  is  10  nanoseconds,  the  cache 
READ/WRITE  speed  (c)  is  20  nanoseconds,  and  the  remaining  paroroeters 
are  multiples  of  the  latter,  as  shown.  In  other  words,  we  used  the 
cache  speed  as  a  logical  building  block  to  "build"  the  forecasts  of  the 
other  hardware  components  (other  than  the  bus) .  For  example,  if  the 
cache  REAOAJRFE  speed  is  20  nanoseconds  the  READ/WRITE  speed  of  main 
memory  would  be  10  x  c  =  200  nanoseconds.  The  parameter  values  ares 


Bus  speed  (b) 

Cache  READ/toRITE  speed  (c) 

Directory  Lookup  (2c) 

Directory  Update  (4c) 

Storage  Level  Controller  (SLC)  speed  (2c) 
Main  Memory  READ/WRITE  speed  (10c) 


=  10  nanoseconds 
*  20 
»  40 
®  80 
*  40 
s  200 


Three  other  scenarios  will  be  tested.  The  bus  speed,  in  all 
three,  will  remain  at  10  nanoseconds.  The  cache  REAP/WRITE  speed, 
hewever,  will  take  the  increasing  values  of  40,  60,  and  80  nanoseconds. 


And  finally,  the  remaining  parameters  will  maintain  their  relative 
values  in  terms  of  n,  the  cache  READ/WRITE  speed.  For  example,  when 
the  cache  READ/WRITE  speed  (c)  becomes  40  nanoseconds  the  READ/WRITE 


speed  of  main  memory  will  be  10  x  c  *  400  nanoseconds. 


Selecting  several  "good"  1S85  scenarios  reflects  the  fact  that 
different  options,  in  building  an  INFOPLEX,  will  be  available  to 
accomodate  the  oost/pecformance  tradeoffs.  And  because  the  bulk  of  the 
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system. s  oost  lies  largely  in  the  storage  components,  the  variations  in 
the  forecasts  between  the  different  scenarios  involved  mainly  those 
o-.ponents. 


V.5  Analysis  of  the  Simulation  Results; 

As  mentioned  previously,  our  dependent  variable  in  this  experiment 
is  system  throughput.  It  is  also  die  criterion  we  use  to  evaluate  the 
performance  of  the  caclio -consistency  handling  mechanisms. 

* 

Tin?  throughput  of  any  computer  system  is  bounded  by  one  of  two 
factors,  namely,  bottlenecks  in  the  system  or  the  transaction  load  on 
it.  In  section  v.2  we  mentioned  that  our  models  will  operate  in  a 
maximum  load  closed- loop  environment,  where  new  transactions  are  * 
continuously  generated  to  replace  serviced  ones.  In  such  an 
environment,  bottlenecks  will  definitely  arise,  limiting  the  throughput 
of  the  system. 

Thus,  in  the  pcocess  of  interpreting  the  simulation  results,  we 
wish  to  identity  and  analyze  system  bottlenecks.  In  such  an  analysis, 
one  neour.  bo  consider  four  major  factors  that  directly  influence  the 
evolution  of  a  particular  system  component  into  becoming  the  system. s 
bottleneck.  The  four  factors  are: 

1.  Architecture:  Consider  for  example  the  (a)  PC  and  (b)  CC 

architectures  of  Figure  8.  In  PC  the  local  bus  of  level  1  (the 


cache  level)  handles  only  the  communications  between  the  five 
caches  and  main  memory.  In  the  "OC/prB"  architecture  the  cache 
level  bus  must  handle,  in  addition,  the  oorrmunications  between  all 
the  processors  and  the  cache-modules.  The  chances  for  the  local 
bus  of  level  1  (LBUS1)  to  become  the  system  bottleneck  are, 
therefore,  much  higher  in  the  CC  architecture  than  they  are  in  PC. 

2.  Algorithms  and  Protocols:  The  set  of  algorithms  supported  by 
an  architecture  and  the  protocols  and  mechanisms  used  to  implement 
them  undoubtedly  influence  the  utilization  patterns  of  the 
different  architectural  components.  Consider  for  example  the 
central  role  of  the  Store-Controller  "box"  in  implementing  the 
algorithms  that  support  the  READ  and  WRITE  operations  in  the 
"Store-Controller"  approach .  Such  a  role  will  inevitably  lead  to 
high  levels  of  utilization  of  the  Store-Controller. 

3.  Workload:  We  mentioned  in  section  v.3  that  we  intend  to 
evaluate  the  "CC/PTB"  approach  using  two  different  load 
distributions  on  the  cache-modules,  a  uniform  distribution  and  a 
linear  one.  In  the  latter  case,  the  cache-module  carrying  the 
highest  load  (i.e.,  30  %  of  total  load)  could  clearly  develop  into 
a  system  bottleneck. 

4.  Hardware  Components,  Characteristics:  The  characteristic  of 
significance  here  is  speed.  For  an  example  we  refer  again  to 
Figure  8.  All  communications  between  level  1  and  level  2  in  the 
INFOPLEX  storage  hierarchy  go  through  the  Storage  Level 
Controllers  (SLCs)  of  both  levels  as  well  as  through  the  Global 
Bus.  The  time  needed  by  the  Global  Bus  bo  process  any  of  the 
communication  messages  (i.e.,  the  time  it  takes  to  transmit  the 
message)  will  usually  be  less  than  that  needed  by  the  SLC  to 


process  the  same  mcnaage.  This  moans  that  tho  SIC  will  always 
saturate  before  the  Global  Bus  can  develop  into  a  bottleneck. 

As  port  of  the  standard  GPSS  output,  the  utilizations  of  all 
hardware  components  (that  are  modelled  as  GPSS  "facilities")  are 
printed.  This  allows  us  to  precisely  identify  the  system 
bottleneck (s) .  An  example  is  shown  in  Figure  9.  In  this  case  the 
bottleneck  is  LBUS1  (the  local  bus  at  level  (1))  with  a  utilization  of 
approximately  100  %. 

In  Figure  10  our  simulation  results  for  the  four  scenarios  are 
presented.  In  the  discussion  that  follows,  we  will  identify  the 
different  scenarios  by  their  cache  speed/bus  speed  ratios  (n)  (i.e.,  n 
=  2,  4,  6,  or  8)  as  it  is  a  convenient  parameter  that  completely 
characterizes  each. 

For  each  technology  scenario  (i.e.,  n  value)  seven  curves  are 

plotted.  The  single  solid  ( - )  curve  portrays  the  performance  of  the 

"OPT"  model,  in  which  no  mechanisms  for  handling  the  cache-consistency 
problem  (and,  therefore,  no  overheads)  are  incorporated.  Thus  the 
throughput  cf  the  "OPT"  model,  as  is  demonstrated  in  the  figure, 
provides  a  ceiling  for  all  the  other  models.  The  "Store-Controller" 
model. s  results  are  depicted  by  the  two  dash-and-dot  (— ._)  curves  (SI) 
and  (S2)  .  Curve  (SI),  which  is  always  the  dominating  curve,  is  for  the 
case  where  there  is  no  data  sharing  between  the  caches,  and  (S2)  is 
when  50  %  data  sharing  is  introduced.  And  finally,  there  are  two 
curves  for  each  of  the  two  implementations  of  the  "CC/prB"  approach. 

The  two  dashed  ( — )  curves  (PI)  and  (P2)  belong  to  the  "OC/PTiyp" 
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facility 

GPjS 
L  Bub  1 
IBUS2 
DR  P  1  1 
DRP  1  2 
0R°t  3 
DRP  1  4 
URPl  5 
KRP  1 

KP?2 

RP?2 

DSP21 


AVERAGE 
UTI LIZATION 
.474 
— ►  .999 
.702 
.396 
.405 
.41  1 
.369 
.398 
.479 
.485 
.237 
.198 


NUMBER 
ENTRIES 
1  ’99 
6514 
1  702 
679 
695 
703 
629 
681 
1  199 
1214 
719 
496 


AVERAGE 
TIME/TRAN 
3.959 
1  .535 
4.126 
5.846 
5.833 
5.849 
5.879 
5.856 
4.000 
4.000 
3.997 
4.000 


SEIZING 
TRANS.  NO 

59 

72 

37 
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Figure  (9) 


preempting 

TRANS.  NO. 
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implementation,  while  the  two  dotted  (...)  ones  (Cl)  and  (C2)  are  for 
"CC/PTB/C."  The  two  dominating  curves  in  both  implementations,  namely 
the  (PI)  and  (Cl)  curves,  are  for  the  case  where  the  load  is  uniformly 
distributed  among  all  cache-modules.  The  two  other  curves  (P2)  and 
(C2)  depict  the  performances  under  the  linearly  distributed  load. 

As  Figure  10  demonstrates,  the  "CC/FTBW  architecture,  in  both  its 
implementations ,  completely  dominates  the  "Store-Controller"  in  three 
out  of  the  four  technology  scenarios.  Only  in  the  first  scenario  (n  » 
2)  does  the  "Store-Controller"  shew  a  performance  advantage  and  only 
for  high  READ  rates.  The  remaining  part  of  this  section  will  be 
devoted  to  an  analysis  of  the  different  factors  affecting  the 
performance  patterns  demonstrated  in  Figure  10.  Of  particular  help  in 
conducting  this  analysis  is  the  information  of  Figure  11  depicting  the 
system  bottlenecks  in  all  the  cases  tested. 

There  are  two  basic  patterns  that  deserve  separate  analysis.  The 
distinctive  pattern  of  n  =  2,  and  the  pattern  common  among  the  three 
other  scenarios,  n  =  4,  6,  and  8.  To  analyze  the  latter  we  will 
arbitrarily  pick  the  case  of  n  »  6  as  our  analysis  vehicle. 


V.5.1.  Case  of  n  «  2 

To  many,  the  most  surprising  aspect  of  these  results  will  be  the 
almost  identical  shapes  of  the  "OPT"  curve,  and  the  "Store-Controller" 
curve  (SI) .  To  understand  why  this  is  so,  we  first  note  fran  Figure  11 
that  in  both  models  LBUS2  (i.e.,  the  local  bus  of  level  2)  is  the 
bottleneck.  Now,  the  difference  between  the  "Store-Controller"  model 


CC/PTB/C 


CC/PTB/P 
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ana  the  "OPT"  model  lies  in  the  overhead  necessary  to  handle  the 
cache -consistency  problem.  All  this  overhead  (in  the 
"Store-Controller"  model)  is  in  the  form  of  additional  operations  which 
all  take  place  at  the  cache  level.  Thus,  while  removing  this  overhead 
(in  the  "OPT"  model)  will  necessarily  decrease  the  load  on  the  cache 
level,  it  will  not  decrease  the  load  on  level  2,  and  in  particular  on 
LBUS2.  Thus,  since  L0US2  is  the  bottleneck  in  both  cases,  and  sinoe 
the  load  on  it  (by  an  "average"  transaction)  remains  the  same;  the 
performances  in  both  are  very  similar.  For  the  same  reasons,  the  other 
five  models,  namely,  (S2)  ,  (PI) ,  (P2)  ,  (Cl) ,  and  (C2) ,  have 
performances  close  to  that  of  "OPT"  for  %  READs  ,=  80  (i.e.,  they  all 
have  LBUS2  as  the  system  bottleneck) . 

The  fact  that  LBUS2  is  the  bottleneck  is  itself,  by  the  way,  an 
interesting  finding.  With  a  hit  ratio  of  0.S0  for  READ  requests  and 
1.00  for  WRITE  requests  most  of  the  "action"  is  clearly  done  at  the 
cache  level.  It  would,  therefore,  seem  that  LBUS1  must  be  saturated 
before  LBUS2.  The  answer  goes  back  to  the  parameter  values  of  section 
V.2.  Notice  that  the  transaction  size  for  level  2  (64  bytes)  is  eight 
times  larger  than  that  for  level  1  (8  bytes) .  This  means  that  a  single 
transmission  on  LBUS2  will  be  approximately  eight  times  longer  in  time 
than  a  single  transmission  on  LBUSl.  Thus,  al though  the  absolute 
number  of  transmissions  on  LBUS2  is  less  than  that  on  LBUSl,  the 
utilization  of  LBUS2  in  this  case  is  higher. 

Another  interesting  observation  relates  to  curve  (S2)  of  the 
"Store  Controller."  Curve  (D2)  i3  always  below  curve  (SI)  because  In 
(S2)  the  "Store  Controller"  (SC)  is  the  system  bottleneck  with  a 
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saturation  point  lower  than  that  of  LBUS2 .  The  reason  this  happens  is 
that  the  high  degree  of  data  sharing  between  the  caches  (and  which  is 
"orchestrated"  by  the  Store-Controller)  means  a  higher  utilization  of 
the  Store-Controller  by  the  "average"  READ/WRITE  request.  Notice  also 
that  the  two  curves  (SI)  and  (S2)  diverge  as  the  %  of  READS  increases. 
This  is  merely  a  reflection  of  the  pattern  by  which  the  READ  and  WRITE 
operations  utilize  the  two  respective  bottlenecks,  LBUS2  and  SC. 
(Straightforward  analytic  calculations  would  demonstrate  that  the 
effective  capacity  of  LBUS2  increases  faster  than  docs  that  of  SC  as 
the  %  of  READs  increases.) 

Next  let  us  turn  our  attention  to  the  "CC/PTB"  results.  Notice 
first  the  deflections  in  curves  (Pi) ,  (P2) ,  (Cl) ,  and  (C2) .  This 
happens  because  at  approximately  80  %  READS  LB0S1  (and  not  LBUS2) 
be acmes  the  system  bottleneck  in  the  four  cases.  However,  even  though 
LEUSl  is  the  bottleneck  for  both  "OC/PTE/P"  and  "CC/PTB/C" ,  "CC/PTB/P" 
clearly  dominates.  This  simply  is  because  an  "average"  READ/WRITE 
request  utilizes  LBUSl  less  often  in  "CC/PTE/P"  than  it  does  in 
"CC/PTE/C."  For  example,  in  CC/PTE/P  when  a  requested  data  item  is 
found  by  the  processor  not  to  be  in  the  cache  level,  a  request  is  sent 
to  main  memory  via  LBUSl.  In  CC/PTB/C,  on  the  other  hand,  a  CPU 
request  must  first  go  to  a  cache-module  (through  LBUSl) ,  only  to  be 
found  unavailable,  and  then  forwarded  by  the  cache-module  to  main 
memory  through  LBUSl  again. 

Notice  finally  the  negligible  effect  that  the  load  distribution  on 
the  cache-modules  has  on  performance  (i.e.,  curves  (Cl)  and  (C2)  are 
similar,  as  well  as  curves  (PI)  and  (P2) ) .  The  reasons  for  this  are 


completely  analogous  to  the  ernes  explaining  the  close  resemblance 
between  the  "OPT"  curve  and  curve  (SI) .  In  other  words,  changing  the 
load  distribution  does  not  a£fect  the  utilization  of  LBUSl.  As  long  as 
no  new  bottleiieck  develops  because  of  the  change  in  the  load 
distribution,  LBUSl  will  remain  to  be  the  bottleneck,  and  thus  maintain 
approximately  the  same  throughput.  The  utilization  of  the  cache-module 
that  carries  the  highest  load  under  the  linear  load  distribution  for 
both  "OC/PTB/P"  and  "CE/PTB/C"  turns  out  never  to  exceed  60  %,  which  is 
far  below  the  100  %  mark  that  has  to  be  approached  before  it  would 
replace  LBUSl  as  the  system  bottleneck.  This,  in  a  sense,  is  very 
comforting  to  know.  It  shows  that  the  behavior  of  the  models  is,  to  a 
large  extent,  insensitive  to  the  shape  of  the  load  distribution  on  the 
cache-modules  that  we  used. 


V.5.2.  Case  of  n  »  6 

Most  of  the  ideas  of  the  above  discussion  are  applicable  to  the 
case  of  n=6.  For  example,  "OPT,"  "OC/PTE/P".s  (Pi)  and  (P2) ,  and 
"CC/FTE/C"  .s  (Cl)  all  have  almost  identical  performances  because  they 
all  have  the  same  bottleneck,  namely,  the  Storage  Level  Controller 
(SLC) . 


Notice,  on  the  other  hand,  that  because  the  bus  is  now  relatively 
faster  in  comparison  to  the  other  system  components,  and  in  particular 
to  the  Store-Controller  (SC) ,  LBUS2  ceases  to  be  the  bottleneck  in  the 
two  "Store-Controller"  models.  Instead,  SC  is  now  the  bottleneck,  and 
the  degradation  in  performance  is  quite  evident.  However,  notice  that 
even  though  SC  is  the  bottleneck  for  both  (SI)  and  (S2) ,  the 
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throughputs  are  quite  different.  The  reason  for  this  is  that  an 
"average"  READ/WRITE  request  in  (S2)  (where  there  is  50  %  data  sharing) 
requires  more  "services"  from  the  "Store-Controller"  than  does  an 
"average"  READ/WRITE  request  in  (SI) .  (When  data  sharing  exists 
between  the  cache-modules,  the  SC  has,  for  example,  to  invalidate 
redundant  copies  when  a  data  item  is  modified.) 

The  only  remaining  result  that  deserves  some  explanation,  is  the 
deflection  exhibited  in  curve  (C2)  of  "OC/PTB/C"  at  %  READS  *  80.  The 
reason  for  this  behavior  is  that  somewhere  between  %  READS  =  80  and 
%  READS  =  85  the  most  heavily  loaded  cacht -module  (i.e.,  the  one 
carrying  30  %  of  the  load)  replaces  SDC  as  the  system. s  bottleneck. 

(See  Figure  11) .  Notice  that  this  does  not  happen  in  "CC/PTB/P.s" 
curve  (P2)  even  though  the  same  linear  load  distribution  is  used.  The 
reason  for  this  is  because,  even  though,  the  cache-modules  in  both 
cases  are  subjected  bo  the  same  load  distribution,  they  are  not 
subjected  to  the  same  load.  This,  of  course,  is  because  the  READ/WRITE 
operations  in  the  "OC/FTB/P"  implementation  use  the  cache-modules  less 
often. 


V.6.  Conclusion 


The  above  results  clearly  indicate  that  no  one  approach  dominates 
over  all  four  beclinology  scenarios.  Technology,  therefore,  must  remain 
as  an  element  of  some  uncertainty. 
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It  is  important  to  realize,  though,  that  what  is  really  important 
in  our  technology  forecasts  is  not  the  absolute  values  of  the  different 
speeds,  but  rather  the  relative  values  for  the  different  hardware 
components.  And  the  most  important  such  value  is  the  relative  speed  of 
the  bus  vis-a-vis  the  processing  components  that  use  it.  Our  results 
clearly  demonstrate  that  the  faster  the  bus  is  relative  to  everything 
else,  the  more  appealling  the  "CC/PTB"  approach  becomes.  More 
specifically,  when  the  bus  is  four  times  as  fast  as  the  cache  or  faster 
(i.e.,  n  v=  4) ,  the  "OC/PTB"  approach  provides  a  20  to  60  %  performance 
advantage  over  the  "Store-Controller''  approach. 

But,  what  perhaps  is  the  most  interesting  finding,  is  the  fact 
that  for  three  out  of  the  four  technology  scenarios  (with  n  4)  the 
performance  of  our  "GC/PTB"  scheme  is  very  close  to  that  of  "OPT." 

Notice  that  in  the  above  statements  no  attempt  was  made  to  single 
out  any  of  the  two  different  implementation  schemes  of  the  "CC/FTB" 
approach.  One  of  the  interesting  findings  in  the  simulation  results  is 
that  the  performances  of  both  schemes  are  very  close  indeed.  (Xir  cwn 
intuition  was  that  the  "OC/FCTB/P"  implementation  would  display  a 
performance  advantage.  It  was  clear,  that  by  utilizing  "fast" 
processors  that  would  overlap  address  translation  with  arbitrating  for 
the  bus,  the  utilizations  of  the  bus  and  the  cache-modules  would 
decrease.  Although  the  utilizations  were  indeed  lower,  this  did  not 
materialize  into  the  higher  performance  we  anticipated.  The  reason: 
the  execution  of  the  INFOPLEX  storage  hierarchy  operations  and  storage 
mechanisms  was  such  that  the  storage  level  controller  (SBC) ,  whose 


utilization  is  independent  of  the  "OC/PTB"  scheme  used,  would,  in  most 
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cases,  be  the  first  component  to  saturate.  The  only  exception  to  this 
is  when,  under  the  linear  Load  distribution  case,  the  most  heavily 
utilized  cache-module  developed  into  the  system. s  bottleneck.  In  such 
a  case  the  higher  performance  potential  of  the  "CC/PTB/P"  scheme  was 
indeed  realized. 


VI.  CONCLUSION 


This  paper  is  a  report  on  an  ongoing  research  effort  at  the  M.I.T. 
Sloan  School  of  Management  to  study  the  multi  cache -consistency  problem 
in  highly  parallel  multi -processor  computer  systems. 

There  are  three  basic  approaches  proposed  in  the  literature  to 
handle  the  multi  cache -consistency  problem:  The  "Broadcasting" 
approach,  the  "Store-Controller"  approach,  and  the  "Multics"  approach. 
However,  serious  drawbacks  cx>u..u  be  identified  in  each.  A  new  approach 
called  the  "OC/PTB"  was,  therefore,  developed.  It  attempts  to  minimize 
performance  degradation  by  minimizing  the  overhead  of  maintaining 
cache-consistency. 

The  "OC/PTB"  approach  was  implemented  in  the  INFCPLEX  storage 
hierarchy  and  evaluated  using  simulation  modeling.  The  results  are 
very  favorable. 
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Tills  work  opens  up  many  areas  foe  further  investigation.  No 
mention  was  made  in  this  paper,  for  example,  of  the  replacement 
algorithms  at  the  cache  level.  It  would  be  interesting  to  find  out  the 
value  of  implementing  a  "sophisticated”  algorithm  such  as  the  "Least 
Recently  Used"  (LRU)  algorithm  (or  versions  of  it)  as  opposed  to  a 
naive  algorithm  e.g.,  random  replacement  that  requires  much  less 
hardware  overhead. 

We  have  assumed,  as  is  conrnon  in  the  literature,  that  all  WRITE 
operations  for  a  particular  data  item  are  preceded  by  READ  operations. 
Relaxing  this  constraint,  and  developing  efficient  algorithms  to 
exploit  both  this  relaxation  and  the  architecture  of  the  storage 
hierarchy  is  definately  worth  investigating. 

And  finally,  the  "OC/PTB"  architecture  should  be  exploited  in  the 
development  of  algorithms  that  would  improve  the  reliability  of  the 
data  storage  hierarchy.  The  automatic  data  repair  algorithms,  for 
exanple,  are  particularly  interesting  and  promising. 
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